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High performance Beowulf computer for lattice QCD 
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We describe the construction of a high performance parallel computer composed of PC components, as well as 
the performance test in lattice QCD. 



1. INTRODUCTION 

The ZhongShan University Computational 
Physics group's interests |jj cover such topics as 
lattice QCD(|-[| q uan t um instanton|(| and quan- 
tum chaos . Most of these topics can be studied 
through Monte Carlo simulation, but can be quite 
costly in terms of computing power. In order to 
do large scale numerical investigations of these 
topics, we required a corresponding development 
of our local computing resources. 

The demarcation between super computers and 
personal computers has been further blurred in 
recent years by the high speed and low price of 
modern CPUs and networking technology and the 
availability of low cost or free software. By com- 
bining these three elements - all readily available 
to the consumer - one can assemble a true su- 
per computer that is within the budget of small 
research labs and businesses. 

We document the construction and perfor- 
mance of a Beowulf cluster of PCs, configured 
to be capable of parallel processing. 

2. SYSTEM 

Our cluster consists of ten PC type comput- 
ersQ, each with two Pentium III-500 CPUs in- 
side. The logic behind dual CPU machines is that 
one can double the number of processors without 
the expense of additional, cases, power supplies, 
motherboards, network cards, et cetera. Also, 
the inter-node communication speed is faster for 
each pair of processors in the same box as com- 



pared to communication between separate com- 
puters. Each computer has an 8GB EIDE hard 
drive, 128 MB of memory, a lOOMbit/s Ethernet 
card, a simple graphics card a floppy drive and 
a CDROM. In practice the CDROM, the floppy 
drive, and even the graphics card could be con- 
sidered extraneous, as all interactions with the 
nodes could be done through the network. One 
computer has a larger hard disk (20 GB), and a 
SCSI card for interaction with a tape drive. For 
the entire cluster we have only one console con- 
sisting of a keyboard, mouse and monitor. 

A fast Ethernet switch handles the inter-node 
communication. The switch has 24 ports so there 
is ample room for future expansion of the cluster 
to up to a total of 48 processors. Of course it 
is possible to link multiple switches or use nodes 
with more that two processors, so the possibilities 
for a larger cluster are nearly limitless. 

We have installed a Red Hat Linux 6.1 distribu- 
tion. It automatically supports dual CPU com- 
puters. It is also able to support a Network File 
System (NFS), allowing all of the nodes in the 
cluster to share hard disks, and a Network In- 
formation System (NIS), which standardizes the 
usernames and passwords across the cluster. 

We can use the the cluster for parallel pro- 
cessing by using the message passing interface 
(MPI) , a library of communication functions and 
programs that allow for communication between 
processes on different CPUs. The programmer 
must design the parallel algorithm so that it ap- 
propriately divides the task among the individual 
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Machine 


/it-sec/link 


MB/sec 


SX-4 


4.50 


45 


SR2201 


31.4 


28 


Cenju-3 


57.42 


8.1 


Paragon 


149 


9.0 


ZSU Cluster 


3.98 


11.5 



Table 1 

Performance of MPI QCD benchmark 



processors. He or she must then include message 
passing functions in the code which allow infor- 
mation to be sent and received by the various 
processors. 

3. PERFORMANCE 

As we primarily developed the cluster for nu- 
merical simulations of lattice QCD, we have also 
performed a benchmark which specifically tests 
the performance in a parallel lattice QCD code. 
The algorithm can conveniently divide the lattice 
and assign the sections to different processors. 

Hioki and Nakamura || provide comparison 
performance data on SX-4 (NEC), SR2201 (Hi- 
tachi), Cenju-3 (NEC) and Paragon (Intel) ma- 
chines. Specifically, we compare the computing 
time per link update in microseconds per link and 
the inter- node communication speed in MB/sec. 
The link update is a fundamental computational 
task within the QCD simulation and is therefore 
a useful standard. The test was a simulation of 
improved pure gauge action (lxl plaquette and 
1x2 rectangle terms) on a 16 4 lattice. In each 
case the simulation was run on 16 processors. The 
results are summarized in Table |l|. 

A widely used QCD program is the MILC code 
fl(i|| . It has timing routines provided so that one 
can use the parallelized conjugate gradient rou- 
tine in the simulation as a benchmark. Further- 
more, as this code is very versatile and is de- 
signed to be run on a wide variety of comput- 
ers and architectures. This enables quantitative 
comparison of our cluster to both other clusters 
and commercial supercomputers. In the MILC 



benchmark test we ran to a convergence toler- 
ance of 10 -5 per site. For consistency with bench- 
marks performed by others, we simulated Kogut- 
Susskind fermions. 

We have run the benchmark test for different 
size lattices and different numbers of processors. 
It is useful to look at how performance is affected 
by the number of CPUs, when the amount of data 
per CPU is held fixed, that is each CPU is re- 
sponsible for a section of the lattice that has L A 
sites. For one CPU, the size of the total lattice is 
L 4 . For two CPUs it is L 3 x 2L. For four CPUs 
the total lattice is L 2 x (2L) 2 ; for eight CPUs, 
L x (2L) 3 , and for 16 CPUs the total size of the 
lattice is (2L) 4 . 

Note that the falloff in performance with in- 
creased number of CPUs is dramatic. This is be- 
cause inter-processor message passing is the slow- 
est portion of this or any MPI program and from 
two to sixteen CPUs, the amount of communica- 
tion per processor increases by a factor of four. 
Table || shows that for a lattice divided into 2 J 
hypercubes, each of size L 4 , there will be j di- 
rections in which the CPUs must pass data to 
their neighbors. The amount of communication 
each processor must perform is proportional to 
the amount of interface per processor. As this 
increases, per node performance decreases until 
j = 4 and every lattice dimension has been di- 
vided (for a d — 4 simulation), and the per- 
processor performance should remain constant as 
more processors are added. The shape of this de- 
cay is qualitatively consistent with 1/j falloff. 

Of course there are other ways to divide a four- 
dimensional lattice. The goal of a particular sim- 
ulation will dictate the geometry of the lattice and 
the therefore the most efficient way to divide it up 
(generally minimizing communication). A four- 
CPU simulation using a 4L x L 3 lattice has the 
four hypercubic lattice sections lined up in a row 
(as opposed to in a 2 x 2 square for a L 2 x (2L) 2 
lattice) and has the same amount of communi- 
cation per CPU as does the L 3 x 2L two-CPU 
simulation. In a benchmark test the per-CPU 
performance was comparable to the performance 
in the two-CPU test. 

For a single processor, there is a general de- 
crease in performance as L increases, as shown in 
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L.V. 


T.I. 


I./CPU 
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2^ 


L 4 "J x (2L)J 


VjL 3 


jL a 
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L 4 
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L 3 x 2i 


2L 3 


L 3 


2 


4 


L 2 x (2L) 2 


8L 3 


2L 3 


3 


8 


L x (2L) 3 


2AL 3 


3L 3 


4 


16 


(2i) 4 


64L 3 


AL 3 



Table 2 

Boundary sizes for division of a lattice into 
1,2,4,8 and 16 hypercubes of size L A . Here I.D. 
stands for interface directions, H. for hypercubes 
(CPUs) L.V. for lattice volume, T.I. for total in- 
terface, and I./CPU for interface/CPU respec- 
tively. 



L 


single processor speed (Mflops) 


4 


161.5 


6 


103.2 


8 


78.6 


10 


76.4 


12 


73.9 


14 


75.9 



Table 3 

Single CPU performance of MILC code. 



Table [|. This is well explained in |ll| as due to 
the larger matrix size using more space outside of 
the cache memory, causing slower access time to 
the data. 

For multiple CPUs there is in performance im- 
provement as L is increased ||. The explanation 
for this is that the communication bandwidth is 
not constant with respect to message size. For 
very small message sizes, the bandwidth is very 
poor. It is only with messages of around lOkB 
or greater that the bandwidth reaches the full 
potential of the fast Ethernet hardware, nearly 
lOOMbit/sec. With a larger L, the size of the 
messages is also, improving the communication 
efficiency. The inter-node communication latency 
for our system is 102/^s. As inter-node communi- 



cation is the slowest part a parallel program this 
far outways the effect of cache misses. 

To summarize, a parallel cluster of PC type 
computers is an economical way to build a pow- 
erful computing resource for academic purposes. 
On an MPI QCD benchmark simulation it com- 
pares favorably with other MPI platforms. The 
price/performance ratio(|] is %7/Mflop. It is 
drastically cheaper than commercial supercom- 
puters for the same amount of processing speed. 
It is particularly suitable for developing research 
groups in countries where funding for pure re- 
search is more scarce. We have been doing large 
scale calculations of hadron and glueball spec- 
trum @. 
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